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ABSTRACT 

The Support Vector Machine provides a new way to design classification algorithms which learn from examples 
(supervised learning) and generalize when applied to new data. We demonstrate its success on a difficult classification 
pro lem from hyperspectral remote sensing, where we obtain performances of 96%, and 87% correct for a 4 class 
pro 3 em and a 16 class problem respectively. These results are somewhat better than other recent results on the same 
c ata. A key feature of this classifier is its ability to use high-dimensional data without the usual recourse to a feature 
se ection step to reduce the dimensionality of the data. For this application, this is important, as hyperspectral data 
consists of several hundred contiguous spectral channels for each exemplar. We provide an introduction to this new 
appioach, and demonstrate its application to classification of an agriculture scene. 
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1. INTRODUCTION 

The Support Vector Machine (SVM) is a relatively recent approach introduced by Boser, Guyon, and Vapnik, 1 2 
or solving supervised classification and regression problems, or more colloquially learning from examples. In the 
following we will discuss only classification. 

This work is in part motivated by the recent profusion of high-dimensional data in remote sensing, where hyper- 
spectral imaging sensors for research, 3 or commercial use, 4 , 5 measure radiance at hundreds of contiguous channels 

for each ground pixel. For this data, part of the challenge is for classifiers that perform well in such high-dimensional 
spaces. 

Traditionally, classifiers model the underlying density of the various classes and then find a separating surface. 
However density estimation in high-dimensional spaces suffers from the Hughes effect, 6 7 : For a fixed amount of 
training data the classification accuracy as a function of number of bands reaches a maximum and then declines, 
because there is limited amount of training data to estimate the large number of parameters needed. Thus usually 
a feature selection step is first performed on the high-dimensional data to reduce its dimensionality. 

As we will demonstrate, the SVM approach does not suffer this limitation and uses the full dimensionality of the 
yperspectral data. Support Vector Machines directly seek a separating surface through an optimization procedure 
that finds the exemplars that form the boundaries of the classes. These exemplars are called the support vectors. 
This is significant because it is usually the case that there are a small subset of all the training data that are involved 
in defining the separating surface, i.e., those examples that are closest to the separating surface. 

. In addlt ion, the Support Vector Machine approach uses the kernel method, discussed below, to map the data 
with a non-linear transformation to a higher dimensional space and in that space trys to find a linear separating 
surface between the two classes. The transformation to a higher dimensional space tends to spread the data out in 
a way that facilitates the finding of a linear separating surface. In this way the separating surfaces that would be 
non-linear (not a hyperplane) in the original data space can become linear (a hyperplane) in the higher dimensional 
space. Instead of being penalized by the curse of dimensionality and its attendant Hughes effect, the Support Vector 

Further author information: (Send correspondence to J. A. Gualtieri) 

J.A. Gualtieri: E-mail: gualt@peep.gsfc.nasa.gov 
F-F. Cromp: E-mail: cromp@sauc|uoit. gsfc.nasa.gov 



Mm-I.inc can use the full dimensionality of the hyperspectrai data without the feature selection preprocessing step 
YY liy the curse of dimensionality is not a problem for the kernel method is discussed below. 


A number of useful introductions are available in publications and on the world wide web, 8 , 9 , 10 11 f n what 
follows we will first focus on binary classification - in the class or not in the class. Subsequently we will handle 
multiple classes by budding separate classifier for each pair of classes and follow this with a voting strategy to choose 

1. lip rmss h hp 0 


ie plan for the paper is to give an overview of the mathematical formulation for binary classification. Then 
we introduce the optimal margin hyperplane, the transformation of its resulting optimization problem by means of 
agrange multipliers, and its solution. This is done for both the separable and non-separable cases We then discuss 
the kernel method and the generalization to multiple classes. Following this, a section is devoted to describing the 
hyperspectrai data we have used for demonstrating the classifier. Then we discuss implementation details and present 
the results, lhe conclusion summarizes the results and suggests further development of the method. 


2. MATHEMATICAL FORMULATION 

2.1. Classification 

For classification, a set of examples consisting of pairs of class labels and feature vectors is known, and you desire to 
nc a classifier function that gives correct answers on these examples and has low generalization error, meaning it 
gives good results for the class labels when applied to feature vector inputs it has not seen before. 

We are given / training pairs, { Vi , x,-) * = 1, . . consisting of class labels, Vi G {1, -1}, and ^dimensional feature 
vectors, x t - € R n . We wish to find a function /( ; a) : x e-> y that represents the classifier y = /(x; a), where a are 

all the parameters of the classifier. 

2.2. Optimal Margin Method for Separable Data 



Figure 1 . Schematic of separable data in R 2 . The circles are feature vectors in class +1 and the diamonds are 
feature vectors in class — 1. The placement of the hyperplane shown is optimal. 


Vapnik and Chervonenkis, 12 and Vapnik 13 originated the optimal margin method for separable data. With 
reference to Fig. 1 the problem is how to place a hyperplane such that: 

1. All data belonging to class +1 lies on one side of the hyperplane and all data belonging to class -1 lies on the 
other side. 




can 


2. I he hyperplane i.s placed so that the distance of the closest vectors in both classes are the furthest they 
be from the hyperplane. 


Ihe hyperplane is defined by the equation 


w • x + b — 0, 


( 1 ) 


wheie x is a point on the hyperplane, w is the n-dimensional vector perpendicular to the hyperplane, and b is the 
distance of the closest point on the hyperplane to the origin. The classifier is then 


/(x; w, b) = sgn( w • x + b). 


( 2 ) 


Ihe orientation of the hyperplane is chosen so that w points towards the class labeled with +1. Let d t , be the 
perpendicular distance of vector Xj from any point x on the hyperplane, 


w 


( 3 ) 


By pre-multiplying by xp we guarantee that all the efi are positive. Using the hyperplane equation Eq fl) to 
eliminate x in Eq. (3) we obtain ’ v ' 

(w • xi + b ) 

Iwl ' ( 4 ) 


di = yi 


We may then pose the problem as minimize, over all the training vectors, the distance of the hyperplane from all the 
training vectors, and then maximize those distances over all placements of the hyperplane: 


max min 

w, b i = 1, . . . , / 


Vi 


(w • Xj + b) 1 


( 5 ) 


The particular vectors that are found to be nearest the hyperplane are called support vectors and are the central 
result of the approach. We note that the parameters describing the hyperplane, w and b, can be scaled by a constant 
without changing the hyperplane. To remove this ambiguity we choose a canonical form of the hyperplane by scaling: 
w and b such that 5 

w(w.x< + 6)— l( = 0 if * | s a support vector 

L > 0 if i is not a support vector, 

or 

yi(w -Xj +6) - 1 > 0 * = 1, — , /, (6) 

With this normalization the distance of the hyperplane to the nearest feature vector is jw| _1 . When used in Eq. (5) 
we obtain v ; 

max |w| _1 
w, b 


Jfe(w • Xj + 6) - 1 > 0 i= 1, . . . , 

It is convenient to replace maximization of |w| _1 with the equivalent minimization of 5 |w| 2 , where the factor of i 
is cosmetic. 2 

In summary we have: To find the optimal hyperplane for separable data, solve the Quadratic Optimization 
problem given by 


min 4 Iwl 2 


y*(w • Xj + b) — l > o 2 = 1,...,/. 


( 7 ) 



2.3. Lagrange Undetermined Multipliers 

I ho quadratic optimization problem in Eq. (7) can be simplified so as to replace the inequalities with a simpler form 
by transforming the problem to a dual space representation using Lagrangian multipliers. The motivation for doing 

t. us comes from a method invented by Lagrange in mechanics. We give a short digression to motivate the subsequent 
transformations. 


Suppose we have a a potential energy function of dynamical variables, where the dynamical variables are restricted 
to certain ranges. For example a mass, m, which can move only vertically in a gravitational field g is attached to 
a string of length /. The mass has potential energy mgz, but there are also the constraints z < /, and z > -l We 
desire to find the equilibrium position (minimum energy) while taking into effect the constraints. Lagrange lug-ested 
adding additional terms to the original energy function that represent the forces that come into existence onlywhen 
a dynamical variable cannot change freely because the constraint extreme is reached (the inequality becomes an 
equality). In our example the mass m feels the force of gravity and a constraint force when z — -l or z ~ l. We can 
either explicitly restrict the range of values to -/ < z < l, or allow z to vary over [- 00 , 00 ] and put into the potential 
terms which apply constraint forces which are zero unless z < -l or z > l. In this way the constraints are built into 
the energy function and the dynamical variables are no longer subjected to explicit constraints, though now we must 
solve for the additional forces as part of the problem. 


2.4. The Dual Optimization Problem for Separable Data 

With this as motivation we absorb the constraints into the minimization problem by defining Lagrange undetermined 
multipliers (the “constraint forces” discussed above), 

K > 0 i ~ 1, . . ./. 

Defining our extended potential to be 


iwl 2 JU 

£(w, b, Ai , . . ., A;) = — } X j [yi{ w • Xj + b) - 1], 

2 — 1 

the optimization problem becomes 

max min £(w, b, Ai, . .. . , A/) 

Ai . . . A; w, b 

A,- > 0 

Vi{ w -Xi +b) - 1 > 0 i= 


( 9 ) 


( 10 ) 


By choosing A,- > 0 in Eq. (8) and putting the constraints in with a minus sign in Eq. (10) we must maximize over 
A *'■ More can be said concerning the solution of this extended problem. The Lagrange undetermined multipliers 
are zero only when the constraint is an equality. This is because in putting in the Lagrange multipliers into the 
minimization problem, they only makes a contribution when a constraint equality is reached. Thus 


A,- [y,-(w ■ Xf + 6) — 1] = 0 ( 11 ) 

which is called a complementarity condition. Assuming that £ is a differentiable function of w, 6 we then have the 
further condition for a minimum in w, b 


d w = ° ( 12 ) 

9b = °- (13) 

A more formal presentation is given by Fletcher 14 and a very readable recent account is given by Boyd and L 
Vandenberghe. 15 Eqs. (6), (8), (11), (12), (13) are called the Karush-Kuhn-Tucker (KKT) optimality conditions 
and a formal derivation and interpretation can be found in these references. 



o 



Figme 2. Schematic of non-separable data in R“. The circles are feature vectors in class +1 and the diamonds are 
feature vectors in class — 1. There is one feature vector that is not separable. 


Performing the derivatives in Eqs. (12), (13) we have 


i 

w = ^A iyiXi ( 14 ) 

2 — 1 
l 

~ (is) 

2=1 


which when substituted in Eq. (9) allows us to eliminate w and b in Eq. (10) to obtain an equivalent quadratic 
optimization problem. This is called the dual problem optimization and it has simplified constraints: 


maX [ 2 Ejrrl Ej = 1 ■ x j ) Vj A j + Ei = i 

Ai . . . A i J 

A, > 0 1 = 1,...,/ ( 16 ) 


ELi Afj/i = 0. 


In the process of obtaining a solution, we expect that some of the A,- will be 0, and the remaining ones will be 
associated with the support vectors. From the solution for the A, we obtain w from Eq. (14), and b from Eq. (11) 
for any A, > 0. Methods of solving the optimization problem are taken up in the section on implementation. 

2.5. Non-separable Data 


Cortes and Vapnik generalized the optimal margin methods to non-separable data, 16 17 which we discuss next. 
With reference to Fig. 2 the problem is now that there is no way to place a hyperplane such that we can separate 
the training data into two classes. 


Cortes and Vapnik, lb 1( gave the following solution and named it the soft margin classifier. Relax the restriction 
that every training vector of a given class lie on the same side of the optimal hyperplane by introducing new variables, 




i — i, . . . J, that take the values & > 0, This generalizes Eq. (6) to be 

Vi (w • x, + b) - 1 . + £ > 0 i = 1, . . . , /. ^ l7 ) 

Then add a new term, C£Li & (C is a positive constant oo > C > 0), to Eq. (7) that balances the contribution of 
minimizing 2 |w|~ with penalizing solutions for which & get large. The non-separable optimization problem is then 




}fw| 2 + C'E' = 1 6- 


2 /i(w • x,- + b) - 1 + & > 0 2 = 1,...,/ 


&> 0 2=1,...,/. 

Note that as C -4 oo the effect of any & deviating from 0- is increasingly more costly to the minimization. Thus in 
the limit 6 -» oo the optimization reduces to the formulation for the separable case. Note that C 0 is not the 
separable case, because then there is no effect on the cost if & > 0, and the optimal hyperplane would be that one 
that placed itself at the midpoint of those two feature vectors, one from each class, with the largest separation. 

As in the separable case, a dual form can be obtained using two sets of Lagrange multipliers, A,, /q, 2 =1, / 

to handle the two sets of constraints in Eq. (18). We obtain 

£(w, 6, Ai, . . . , A/, /2i, . . . , ^) = 

Iwl 2 i " i 1 

~Y~ + c J2^ M w - x «+ 6 )-l + 6]-X^&- (19) 

* = 1 J i = 1 i — i 

Assembling the constraint inequalities, the properties of the undetermined multipliers, and assuming £ is a differ- 
entiable function which we seek to minimize over w, b, and A f for i = 1, ...,/, we have the KKT conditions for the 
non-separable problem: 


[&(w 

■ Xi + b) - 1 + 6] 

> 

0 

* = 1, - 

..,/ 

(20) 


0 

> 

0 

2=1,. 

..,/ 

(21) 


A 

> 

0 

i = 1, . 


(22) 


m 

> 

0 

2=1,. 


(23) 

[/A'(w 

■ Xi + b) - 1 + 6] 

— 

0 

*= 1,- 


(24) 


AO 6 

— 

0 

2=1,. 

..,/ 

(25) 

dC 

dw 

£ 

-W« 

1 

£ 

li 

= 

0 



(26) 


dC v ^ . 

L—1 

= 

0 



(27) 

dC 






<96 ~~ C Xl ~ flt 

= 

0 

2= 1,.. 


(28) 


Eqs. (20), (21) are the constraint inequalities, Eqs. (22), (23) are part of the definition of the Lagrange undetermined 
multipliers, Eqs. (24), (25) are the complementarity conditions of the Lagrange multipliers, and Eqs. (26), (27), (28) 
are minimization conditions for a differentiable function. When Eqs. (25), (26), and (27) are substituted in Eq.’ (19) 
we obtain the dual problem 

max — o £i=i £j = i ^iVi i x i 4 Xj)Vj^j + £ l=1 A t 


C > A,- > 0 2 = 1,...,/ 
£<=i **■&■ = o 2 = i,...,/ 


( 29 ) 



Note the only difference from the dual of the separable case, Eq. (1(5), is that the “constraint forces”, A, are bounded 
above by 6, reflecting the fact that the original inequality constraint, Eq. ((5), holds only while = () and then 
becomes soft when & > 0, which implies the constraint force saturates. Thus the term soft margin for this approach 
to the non-separable case has intuitive meaning. Note that Eq. (28) combined with Eq. (25) gives & = 0 if A < C 
- the constraint force is not saturated. The terms with A, = C label the non-separable points ' 


Knowing the solutions A f , we can find w for the hyperplane using Eq. (26), and b from any one or more solutions 
for which C > A,- > 0, using Eq. (24) with & = 0. 


2.6. Kernel Method 


Up to this point we have only dealt with classification as a linear function of the training data - the decision surface 
is a hyperplane defined by linear equations on the training data. However, it can be the case that no hyperplane 
exists to separate the data. The non-separable method provides one way to deal with this. As an alternative we 
would like a way to build a non-linear decision surface. An extremely useful generalization which can give non-linear 
decision surfaces and improved separation of the training data is possible using the following idea, first introduced 
by Aizerman, Braverman and Rozoner, 18 and incorporated into machine learning as part of the Support Vector 
Machine by Boser, Guyon, and Vapnik. 1 


Note the way that the training data enters the optimization problems, Eqs. (16), (29), is as dot products. Suppose 
that we map the feature vectors, x € H n into a higher dimensional Euclidean space, 7i, by means of a non-linear 
vector function $ : R n ^ %. Then we may again pose the optimal margin problem in the space % by replacing 
xi • xj, by $( Xi ) • $( Xj ). Then, as before, solve the optimization problem for the A,-. This finds the support vectors 
among the transformed vectors, $(xj), by association with the A; > 0. We then use these to build the classifier 
function: 


/(*, Ai 


■ - *A<) = sgn 


^2 A,-3/,-$(xi) • $(x) + b 


(30) 


Now suppose there exists a kernel function K such that 


A'(xi,xj) = *(xi).*(xj), (31) 

then everywhere x; • Xj occurred, we could replace it with A'(xi, Xj ). We need not explicitly compute $(x), which 
could be computationally expensive, but only need compute the kernel functions. In fact we need not have an explicit 
representation of <3> at all, but only K. The restrictions on what functions can qualify as kernel functions is discussed 
in Burges. 8 

What is gained is that we have moved the data into a larger space where the training data may be spread further 
apart and a larger margin may be found for the optimal hyperplane. In the cases where we can explicitly find 3>, 
then we can use the inverse of $ to construct the non-linear separator in the original space. Clearly there is a lot of 
freedom in choosing the Kernel function and recent work has gone into the study of this idea both for SVM and for 
other problems. 19 

With respect to the curse of dimensionality, we never explicitly work in the higher dimensional space, so we are 
never confronted with computing the large number of vector components in that space. 

For the results presented below, we have used the inhomogeneous polynomial kernel function 

A '( x >y) = (x -y + l) d , (32) 

with d = 7, though we found little difference in our results for d = 2,..., 6. The choice of the inhomogeneous 

polynomial kernel is based on other workers success using this Kernel function in solving the handwritten digit 
problem . 1 

In fact there are principled ways to choose among kernel functions and to choose the parameters of the kernel 
function. Vapnik 2 has pioneered a body of results from probability theory that- provide a principled way to approach 
these questions in the context of the Support Vector M achine. 



Class 

Name 

Number of Ground 
Truth Vectors 

Number of 
Training Vectors 

Number of 
Test Vectors 

Corn-notil 

1008 

201 

807 

Soybean-notill 

727 

145 

582 

Soybean-mintill 

1926 

385 

1541 

Grass-Trees 

732 

146 

586 


Table 1 . Data description of the Indian Pines subset scene. 


2.7. Multi-Class Classifiers 

Two simple ways to generalize a binary classifier to a classifier for K classes are: 


1. Train A binary classifiers, each one using training data from one of the I< classes and training data from all 
the remaining A - 1 classes. Apply all K classifier to each vector of the test data, and select the label of the 
classifier with the largest margin, the value of the argument of the sgn function in Eq. (30). 


2 . 


Train 


K 

2 


K{I< - l)/2 binary classifiers on all pairs of training data. Apply all A' (A' - l)/2 binary 


classifiers to each vector of the test data and for each outcome give one vote to the winning class. Select the 
label of the class with the most votes. For a tie, apply a tie breaking strategy. 


We chose the second approach, and though it requires building more classifiers, it keeps the size of the training data 
smaller and is faster for training. 


3. HYPERSPECTRAL DATA 

In this work, hyperspectral data was obtained from the AVIRIS imaging spectrometer which has (on the ER-2 
aircraft) a ground pixel size of 17 m x 17 m and a spectral resolution of 224 channels, covering the range from 400 
nm to 2500 nm centered at lO^nm intervals. We focus on a part of data taken in June 12, 1992 in the northern 
part of Indiana, U.S. This data" 0 has been studied by D. Landgrebe and students, and his website has a companion 
paper" describing its analysis by a free software package. 22 The data consists of 145 x 145 pixels by 220 bands that 
has been approximatelyconverted from the radiance measured at the sensor to the reflectance, which is an intrinsic 
property of the surface." 2 3 Data from bands where there is a large amounts of water absorption in the atmosphere 
have been replaced by a constant value. 

The scene consists of about two-thirds agriculture, and one-third forest or other natural perennial vegetation. 
There are two major dual lane highways, a rail line, as well as some low density housing, other built structures, 
and smaller roads. Since the scene is taken in June some of the crops present, corn, soybeans, are in early stages 
of growth with less than 5% coverage. The ground truth available is designated into sixteen classes and is not all 
mutually exclusive. 

In order to compare to the recent results of S. Tadjudin and D. Landgrebe, 24 25 we have studied two scenes also 
used in their work. 

1. A part of the 145 x 145 scene, called the subset scene, consisting of pixels [27 — 94] x [31 — 116] for a size of 
68 x 86. [Upper left in the original scene is at (1, 1)]. There is ground truth for over 75% of this scene and it is 
comprised of the three row crops, Corn-notill, Soybean-notill, Soybean-mintill, and Grass-Trees. Table 1 <nves 
further details. 

2. The full 145 x 145 scene for which there is ground truth covering 49% of the scene and it is divided amoung 

16 classes ranging in size from 20 pixels to 2468 pixels. 





Figure 3. Training data for classification in the subset scene. The data has been centered. 


Following Tadjudin and Landgrebe’s work, we have also reduced the number of bands to 200 by removing bands 
covering the region of water absorption; [104 - 108], [150 - 163], 220. For each band at each pixel in the subset scene, 
the data was rescaled from the input two byte short integer by dividing by 10000 to make a floating point number 
in the range [0,1]. 

Then the data was centered, which means that for the whole scene, for each band, the mean was found and this 
was subtracted from all the data in that band. This distributes the data around 0 and considerably speeds up the 
optimization routines. Fig. 3 shows this centered data for the four classes. Note that there is substantial overlap 
between Corn-notill, Soybean-notill, and Soybean-mintill, while Grass-Trees is well separated from the other three 
classes. 


4. APPLICATION OF SVM TO HYPERSPECTRAL DATA AND RESULTS 
4.1. SVM Implementation 

We have implemented the Support Vector Machine first by building on free Matlab software available from S. Gunn 9 
and de Ridder. 11 For the separable case, Gunn has shown that the problem may be recast as non-linear least 
squares and he has provided a routine to perform this. However for the non-separable case, quadratic optimization is 
required. We first used the quadratic optimizer package available from Matlab, which can be slow for larger data sets. 
Subsequently we have adapted the software package from T. Joachims, 26 which works with A. Smola’s optimization 
code, 27 and which takes advantage of the KKT conditions in the non-separable case to provide good performance 
for large training sets. 


4.2. Classifier Results for the Subset Scene 

From the subset scene, a random sample of 20% of the pixels was chosen from the known ground truth of the four 
classes: Corn-notill, Soybean-notill, Soybean-mintill, Grass-Trees. This was used to train six binary classifiers, one 
for each pair of classes. The trained classifiers were then applied to the remaining 80% of the known ground pixels 
in the scene, with the voting strategy above. Ties were broken by a random choice. 

This procedure was repeated in five trials using a different random seed for the selection of the 20% of the training 
data. Table 2 shows the contingency table for a typical trial. For a trial, the overall performance is the sum of the 
number of samples correctly labeled for each class in the test set divided by the total number of samples in the test 





Class 

Name 

Percent 

Number of 

Corn- 

Soybean- 

Soybean- 

Grass- 

Correct 

Samples 

notill 

notill 

mintill 

Trees 

Corn-notill 

94.3 

807 

76 1 

4 

38 

4 

Soybean-notill 

95.7 

582 

1 

557 

23 

1 

Soybean- mint ill 

96.1 

1541 

39 

21 

1481 

0 

Grass-Trees 

100 

586 

0 

0 

0 

586 

TOTAL 

96.3 

3516 

801 

582 

1542 

591 


Table 2. A typical result for the Indian Pines subset scene. The entries in rows 2-5 and colums 4-7 are the 
contingency table for results for the subset scene. For each horizontal line labeled on the left by class name A, the 
entries under the four classnames Bi, B 3 , B 3 , B 4 give the distribution of the all the testing ground truth pixels of 
class A into the four classes. Bold face numbers are correctly classified samples. A perfect result would have all zeros 
except on the diagonal. The overall performance is computed by the ratio of the sum of the diagonal elements to the 
sum of all entries of the contingency table. 


Trial 

Overall 

Class correct (%) 


correct (%) 

Corn-notill 

Grass- Trees 

Soybean-notill 

Soybean-mintill 

1 

96.3 

94.3 

100.0 

96.1 

95.7 

2 

95.8 

92.8 

99.8 

95.7 

96.0 

3 

96.1 

95.2 

99.8 

95.7 

94.7 

4 

95.5 

94.7 

100.0 

95.1 

93.5 

5 

95.6 

95.7 

99.8 

94.8 

93.3 

Average 

95.9 

94.5 

99.9 

95.5 

94.6 


Table 3. Summary of trials on SVM classifier for the Indian Pines subset scene. 


set. Table 3 summarizes the five trials. Note that Grass-Trees was classified almost completely correctly as might be 
expected from the lack of overlap in the training data. 

The results across the five trials were consistent within one percent and the average performance was 96%, 
which is somewhat better than 93% from recent results of Tadjudin and Landgrebe 24 for their best classifier, 
bLOOC+DAFE+ECHO, on the same data. Table 4 summarizes the results for the subset scene comparing the 
SVM classifier, bLOOC+DAFE+ECHO, and a simple Euclidean classifier. 24 The Euclidean classifier uses only the 
first order statistics of the training data. Its poor performance is expected for this data due to the overlap of the 
classes. The details of the bLOOC+DAFE+ECHO classifier is covered in Tadjudin and Landgrebe. 24 


METHOD 

PERFORMANCE 


Subset Scene 

Full scene 

Support Vector Machine 

95.9% 

87.3% 

bLOOC+DAFE+ECHO 

93.5% 

82.9% 

Euclidean 

66.7% 

48.2% 


Table 4. A comparison of results for the Indian Pines subset scene (68 x 86 pixels) and the full scene (145 x 145 
pixels). The^results labeled bLOOC+DAFE+ECHO and Euclidean are taken from the recent work of Tadjudin and 
Landgrebe- 4 - 5 and represent the best classifier results reported for this scene in that work. Also note that their 
results for the full scene are for 17 classes compared to our 16. The difference is explained in the text. All training 
is based on 20% of the ground truth and testing on the remaining 80%. 






4.3. Classifier Results for the Full Scene 

Results for the full scene were produced using only one trial. Mere we used the sixteen ground truth classes given 
in Landgrebe s data.- 0 We made a random selection of 20% of the ground truth data and tested on the remaining 
80 /o. /^difference with the data and results reported by Tadjuclin and Landgrebe, 2 ' 1 25 is that they studied the scene 
using 17 classes whereas we only used 16. The difference being that they further resolved the class Soybeans-notill 
into two subclasses of Soybeans-notill based on fields that were in different locations in the full scene. The results are 
reported in lable 4 show the Support Vector Machine to be somewhat better, although the difference in the number 
of classes may have some effect. 


5. CONCLUSIONS 

We have described a new approach to building a supervised learning machine called the Support Vector Machine, and 
applied it to classify hyperspectral remote sensing data. The inherent high dimensionality of this data is challenging 
for traditional classifiers, due to the Hughes effect, and usually a feature selection preprocessing step is performed 
to fist reduce the dimensionality of the data. The Support Vector Machine does not suffer from this handicap, and 
is thus suitable for use with hyperspectral data. The results we have obtained show it to be competitive with other 
recently developed classifiers for hyperspectral data when applied to the same data sets. 

In this work the choice of kernel function is ad-hoc, as are the choices of the kernel function parameter d, 
and the separability parameter, C. However, the Support Vector Machine can be placed into the Structural Risk 
Minimization approach of Vapnik, 2 and using rigorous bounds from recent results from probability theory, a more 
rigorous approach can be taken to choosing these parameters. 

Also we note that all the results we have shown are completely in the spectral domain and no aspect of the 
spectral coherence of the data has been used. The results would be identical if all the classifier bands were permuted 
consistently throughout the data. And, we have not utilized the spatial coherence of the data. We note recent 
studies on the classification of hand written digits- 8 show that performance gains can be made by incorporating 
prior knowledge into the construction of the Support Vector Machine and we believe similar gains can be made for 
classifying hyperspectral data using the coherence in the data. 
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